Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach
نویسندگان
چکیده
Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach spam email problem from a different novel perspective. Focusing on needs cybersecurity units, follow topic-based for addressing classification into multiple categories. We propose SPEMC-15K-E SPEMC-15K-S, two datasets with approximately 15K each in English Spanish, respectively, label them using agglomerative hierarchical clustering 11 classes. evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag Words, Word2Vec BERT- classifiers: Support Vector Machine, Näive Bayes, Random Forest Logistic Regression. Experimental results show highest performance is achieved TF-IDF LR dataset, F1 score 0.953 an accuracy 94.6%, while Spanish NB yields 0.945 98.5% accuracy. Regarding processing time, leads to fastest classification, 2ms 2.2ms average, respectively.
منابع مشابه
Implementing Agglomerative hierarchical clustering using multiple attribute
Agglomerative hierarchical clustering algorithm used with top down approach. It implement with multiple attributes. In multiple attributes frequency calculation is allocated. Memory requirements are less in this process. Hierarchical clustering produce accurate result than any other algorithm. This is very less time consuming process.
متن کاملCompetence maps using agglomerative hierarchical clustering
Knowledge management from a strategic planning point of view often requires having an accurate understanding of a firm’s or a nation’s competences in a given technological discipline. Knowledge maps have been used for the purpose of discovering the location, ownership and value of intellectual assets. The purpose of this article is to develop a new method for assessing national and firmlevel co...
متن کاملModern hierarchical, agglomerative clustering algorithms
This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a “stepwise dendrogram”, a data structure which is shared ...
متن کاملDivisive Hierarchical Clustering with K-means and Agglomerative Hierarchical Clustering
To implement divisive hierarchical clustering algorithm with K-means and to apply Agglomerative Hierarchical Clustering on the resultant data in data mining where efficient and accurate result. In Hierarchical Clustering by finding the initial k centroids in a fixed manner instead of randomly choosing them. In which k centroids are chosen by dividing the one dimensional data of a particular clu...
متن کاملClustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering becomes infeasible when applied to large datasets due to its O(N2) storage requirements. We present a multi-stage agglomerative hierarchical clustering (MAHC) approach aimed at large datasets of speech segments. The algorithm is based on an iterative divide-and-conquer strategy. The data is first split into independent subsets, each of which is clustered se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied Soft Computing
سال: 2023
ISSN: ['1568-4946', '1872-9681']
DOI: https://doi.org/10.1016/j.asoc.2023.110226